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Abstract 

While all the information required for the folding of a protein is con- 
tained in its amino acid sequence, one has not yet learned how to extract 
this information to predict the three-dimensional, biologically active, na- 
fT^ , tive conformation of a protein whose sequence is known. Using insight 

c7^ ' obtained from simple model simulations of the folding of proteins, in 

^^ ' particular of the fact that this phenomenon is essentially controlled by 

^wJ , conserved (native) contacts among (few) strongly interacting ("hot"), as 

a rule hydrophobic, amino acids, which also stabilize local elementary 
structures (LES, hidden, incipient secondary structures like a-helices and 
Q , /3-sheets) formed early in the folding process and leading to the postcrit- 

ical folding nucleus (i.e., the minimum set of native contacts which bring 

the system pass beyond the highest free-energy barrier found in the whole 

O^' folding process) it is possible to work out a succesful strategy for reading 

the native structure of designed proteins from the knowledge of only their 

amino acid sequence and of the contact energies among the amino acids. 

J^ ' Because LES have undergone millions of years of evolution to selectively 

?— ^ ' dock to their complementary structures, small peptides made out of the 

same amino acids as the LES are expected to selectively attach to the 
newly expressed (unfolded) protein and inhibit its folding, or to the na- 
tive (fluctuating) native conformation and denaturate it. These peptides, 
or their mimetic molecules, can thus be used as effective non-conventional 
drugs to those already existing (and directed at neutralizing the active site 
of enzymes) , displaying the advantage of not suffering from the uprise of 
resistance. 
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1 Introduction 

The sequencing of the human genome |I1I2], that is the identification 
of the way the thousands of mihons of basis follow each other in the 
human DNA, provides information on the sequence of amino acids 
forming each of the tens of thousands of proteins which build our 
cells and catalize the chemical reaction which make them function. 
Precious as this knowledge is, in terms of gene identification and 
thus eventually for the development of therapies against inherited 
diseases, this sequencing will find its real meaning when the linear 
sequence of the basis can be set in relation with the three dimen- 
sional, biologically active, native structure of the protein it codes 
for. 

In a very real sense, the secret of life, if it exists, is to be found 
at this level of physical and chemical organization. In particular, 
in the tight correspondence existing between the one dimensional 
(ID) structure of a protein (sequence of amino acids) and the na- 
tive structure onto which it folds (3D structure, cf. Fig. H]), once 
produced by the ribosome using the DNA (or better the mRNA) 
blueprint, in typical times which range from microseconds to sec- 
onds. This is the protein folding problem, one of the great unsolved 
problems of science [21 • An even more elusive goal is the prediction 
of the catalytic activity of an enzyme from its amino acid sequence. 

There are two reasons why the protein folding problem is so im- 
portant. First, DNA sequencing is relatively quick, and vast quan- 
tities of amino acid sequences data have become available through 
international efforts. The acquisition of three-dimensional data is 
still slow and is limited to proteins that either crystallize in a suit- 
able form or are sufficiently small to be solved by NMR in solution 
j3]. Algorithms are thus required to translate the linear informa- 
tion into spatial information. Second, one is now able to synthetize 
novel proteins by way of their genes, and so the production of new 
enzymes with specified catalytic activities is a challenging prospect. 
Producing such new enzymes requires, at least, three underpinning 
and interrelated abilities: (1) the ability to design a novel fold of 
the enzyme or to predict the most stable and kinetically accessible 
conformation of an already existing protein if one is going to use a 
wild type conformation as a template; (2) the ability to design on 
this fold a binding site for the substrate and (3) the ability to build 



the catalytic site which is the ultimate responsible for the biological 
activity of the enzyme. Each of these three prerequisites is beyond 
the current capabilities of theory. 

To appreciate some of the difficulties facing theoreticians, we 
must consider the physical nature of protein folding. A denatured 
protein makes many interactions with solvent water. As the protein 
folds up, it exchanges those noncovalent interactions with others 
that it makes within itself: its hydrophobic side chains (which can- 
not build any hydrogen-bond with water) tend to pack with one 
another, and many of its hydrogen bonds donors and acceptors pair 
with each other, especially those in the polypeptide backbone that 
form the hydrogen-bonded networks in helixes and sheets. Each in- 
teraction energy is small, but because of their large numbers, the 
total free energies in the native and denatured states are large, being 
some thousands of kilocalories per mole, depending on the size of 
the protein. Yet proteins are only marginally stable, their free ener- 
gies of unfolding ranging from 5 to 15 kcal/mol. This tiny amount 
of energy is the difference between the free energies of the protein 
in its native and in its denatured states, and these states are al- 
most balanced in energy. Thus, whether a protein folds depends 
crucially on a balance between two large numbers, each of which is 
very difficult to calculate with precision. To predict the stability of 
a protein, we have to calculate not only the interaction energy be- 
tween any two atoms within a protein, but also this energy relative 
to the interactions that the individual atoms make with water in the 
denatured state and the entropic cost of folding, which represents 
the largest contribution to the unfolded free energy. Current po- 
tential functions are not sufficiently accurate for this purpose. But 
we can use protein engineering (that is introduce point mutations 
in the amino acid sequence) and other experimental procedures to 
make changes in existing proteins. Protein engineering experiments 
have provided a practical route into determining quantitatively the 
factors that govern stability. 

It is unlikely that there is a single mechanism for the folding of 
all proteins. As biologists know so well, proteins vary so much in 
structure, size and properties that there are bound to be a number 
of important exceptions to any general scenario 0. Further, evo- 
lution of a specific function may be at the expense of stability or 
optimization of folding rate. In any case, the basic mechanism of 



small globular, single domain proteins all point to a single scheme, 
variations of which could describe a large number of folding path- 
ways. 

Solving how a protein folds up from its denaturated state to its 
native structure poses an intellectual challenge that is far more com- 
plex than solving classical chemical mechanisms. In simple chemical 
mechanisms, there are usually changes in just a small number of 
bonds as a reaction progresses. Chemical bonds are often strong, 
and so stable covalent intermediates can sometimes be isolated and 
characterized. Often, the rules for analyzing mechanisms can be ap- 
plied simply and rigorously. In protein folding, on the other hand, 
the whole molecule changes in structure. Thousands of weak non- 
covalent interactions are made or broken, and it is very difficult to 
trap intermediates because of their unstable nature. An astronomi- 
cal number of conformations must be considered. In particular, the 
denaturated state is particularly difficult to analyze because it is an 
ensemble of many ill-defined rapidly fluctuating conformations. But 
the chemists basic strategy for analyzing the pathways of protein 
folding must still be the same as that for analyzing a simple chemi- 
cal mechanism: characterize all the stable and metastable states on 
(and off) the reaction pathways and the transition states that link 
them. This knowledge will provide the basis for calculations and 
simulations, as well as for developing simple, but not oversimplifled, 
models. 

There is an arsenal of physical theoretical methods for the anal- 
ysis of folding, ranging from precise atomic analysis by molecular 
dynamics simulations, through models involving simplifled polymer 
chains, to completely abstract procedures |3] . The most precise pro- 
cedure of molecular dynamics simulation applies readily to unfolding 
and can, in principle, take a (small) protein to the denaturated state 
in solution. Molecular dynamics simulations have not, as yet, been 
applied to the folding of the plethora of possible conformations. In 
comparison, the more abstract procedures allow simple models for 
folding to be analyzed directly, and they start from a state prior 
to the partly collapsed state in solution. These procedures tend to 
stress general principles. All the methods can give insights into the 
principles of folding that can be tested against experiment, make 
predictions, and flU the entire energy landscape. 

Understanding the folding of proteins is also likely to be momen- 



tous for the developement of both conventional (Hj and non conven- 
tional drugs [7j. Conventional drugs work by inhibiting the enzy- 
matic activity of specific proteins by capping its active site. Phar- 
maceutical companies search for these drugs by a simple trial-and- 
error method, testing hundreds of thousands of molecules on the 
enzymatic target, selecting those which display the best inhibitory 
ability. Only recently some attempt is being made to go beyond 
such brute force method, by calculating theoretically the affinity of 
the molecule to the active site of the enzyme [H]. Simple models 
which employ statistically derived potentials have been proved use- 
ful to perform such calculations in simple cases, but fail grievously 
when the quantum mechanical properties of the molecules enter the 
game, such as in the case of metallo-proteins. 

The use of ab initio methods, like e.g. the density functional 
theory (DFT), to treat the properties of the "problematic" atoms, 
combined with a classical description of the surrounding molecules, 
should allow one to accurately calculate the binding energy of the 
ligand to the enzyme also for systems of realistic size (cf. e.g. [S] 
and refs. therein). 

A second, and more ambitious goal of the project is to design 
non-conventional drugs which, instead of inhibiting the activity of 
the selected proteins, destabilize them by binding to the folding nu- 
cleus j7j, making them prone to proteolysis, that is, to degradation, 
usually by hydrolisis at one or more of its peptide bonds . This 
approach is expected to be particularly interesting in the develop- 
ment of drugs against viral diseases, where it has been observed 
that essential proteins involved in the replication of the virus es- 
cape conventional drugs by inducing mutations in their active sites. 
This road is not open to the virus to escape the action of non- 
conventional drugs since a mutation of an amino acid participating 
in the folding nucleus will lead to the denaturation of the protein. 

The above simple considerations testify to the fact that a com- 
bined attack of the protein folding problem and of drug design 
through the interdisciplinary efforts of biologists, chemists and physi- 
cists is likely to be a strategy which has a good chance of being 
succesful. The rewards of such a success in terms of basic research 
as well as of practical spin-offs are likely to be large. In what fol- 
lows we present the point of view of the physicist, placing special 
emphasis on minimal models and indicating how to extend the re- 



suits to real proteins with the help oi first principle all-atom quantal 
calculations. 



2 Main experimental facts 

The experiments which opened the way to the study of the pro- 
tein folding problem have been those performed by Anfinsen [TUj . 
He studied the equilibrium conformation of small proteins like Ri- 
bonuclease A and Staphylococcal nuclease, changing cyclically the 
conditions of the solvent of the protein solution (pH, temperature, 
etc.). He observed that, independently on the past history of the 
solution, when this is driven back to physiological conditions, a pro- 
tein (characterized by a given amino acid sequence) folds always to 
the same equilibrium conformation. This result proves that the in- 
formation about the unique equilibrium conformation of a protein^ 
and about the pathway to reach it is completely encoded in the 
amino acid sequence. 

Another feature which is common to most proteins is that their 
folding as well as the denaturation process is a highly cooperative 
process. The degree of folding of proteins (i.e., the relative pop- 
ulation of the native state) in solution can be grossly assessed by 
circular dicroism experiments, which measure the rotation of the 
polarization of an incident laser beam induced by the protein. Since 
the native states of proteins are usually rich of motifs like a-helices 
and /^-sheets, and such motifs are able to rotate the polarization 
of an incident light beam, it is possible to measure the degree of 
formation of the motifs and approximate this with the degree of 
folding. Most proteins display a sudden transition in the relative 
population of the native state as the parameters of the system (e.g., 
temperature, pH of the solution, etc.) are varied from or towards 
the biological conditions. This result indicates that some coopera- 
tive mechanism, which involves all the parts of the protein, is acting 
to stabilize the native state. This behaviour is similar to the that 
of a physical system undergoing a first-order phase transition ^2] • 

The average folding time of proteins, that is the time needed to 

^The uniqueness holds at the length scale of the overall structure of the protein. Ex- 
periments by Frauenfelder 1111 showed the existence of "conformational substates", that is 
fine-scale rearrangements of the native conformations. The energy scale at which these sub- 
states become relevant is --^ 15 meV (i.e. it is observable at ~ 200 K), and consequently we 
will not take them into account. 



reach the native state starting from a random conformation, usu- 
ally ranges from microseconds to seconds. Few exceptions span the 
order of magnitude of minutes, because of some lengthy, although 
not very common, structural change taking place inside one of the 
twenty types of amino acids (i.e., proline). The distribution of fold- 
ing times is measured by stopped-flow, rapid quenching or flash 
photolysis methods [Sj, where one prepares the protein solution un- 
der denaturing conditions, reverts instantaneously the conditions 
of the solvent in order to start the folding process and then mea- 
sures the degree of folding through optical techniques (e.g., circular 
dichroism). The resulting distribution of folding times is usually 
a single- or multi-exponential function (in the sense that the con- 
centration of folded proteins in the solution grows as a fuction of 
time with a single- or multi-exponential behaviour). This indicates 
that a Poissonian processes, or a chain of (usually few) Poissonian 
processes, is at the basis of the folding mechanism. 

As a rule, proteins are very tolerant to point mutations. Muta- 
tions in a large number of sites have little or no effect on the folding 
properties of the protein. For example, mutations in 61% of the sites 
of Protein G causes an increase in the stabilization free energy of 
less than 20% ^5. On the other hand, each protein displays some 
key sites which, if mutated, lead to a large destabilization of the 
native state. 



3 Evolutionary thermodynamics 

The number of proteins of known sequence and native conformation 
is of the order of some ten thousands. Nonetheless, the number 
of different topologies that the native conformations can assume is 
restricted to some hundreds. Since such sequences are the result of 
milions of years of evolution and in each step of evolution all the 
proteins must be good folders, under the risk of being selected out, 
the comparative analysis of sequences encoding for similar native 
structures can reveal information about their folding. 

Naively speaking, if one compares all sequences folding to simi- 
lar conformations and measures to which extent the type of amino 
acid is conserved in each site of the protein throughout all these 
sequences, it is possible to gain some insight into the role that the 
different sites play in the folding of that class of proteins. If a given 



site is always occupied by the same kind of amino acid, it is likely 
that that site plays an important role in the folding process. 

In order to make this problem more quantitative, one can con- 
sider the space of sequences and study the subspace associated with 
a particular native conformation (cf. Fig. |2)). A useful order pa- 
rameter to study this space can be defined as the similarity between 
pairs of sequences 

1 ^ 

i 

where the Kronecker delta is equal to 1 if af = a^ and zero other- 
wise. Here a and [3 label two sequences, while cxj indicates the type 
of amino acid occupying site z, while N is the length of the sequence. 
The typical distribution p{qseg) of the order parameter displays two 
peaks, one around qseq ~ 1, corresponding to sequences which are 
different for few amino acids, and the other around g^eg ~ 0.1, corre- 
sponding to sequences whose similarity is comparable to that of ran- 
dom sequences [T3j. The quantity defined above is similar to Parisi's 
replica order parameter for spin glasses, which is the paradigm of 
physical systems controlled by a disordered, complicated interaction 
(for details see ref. [E]). 

We will call " homologous" the pairs of sequences which belong to 
the large-g^eg peak and "analogous" those belonging to the small- 
Qseq peak. In order to study the conservation of protein sites, we 
will make use of analogous sequences only. The reason is that these 
better comply to the hypothesis of statistical independence. Since 
homologous sequences are clearly strongly correlated, the results 
of their statistical analysis suffer from errors due to the particular 
choice of the pairs of proteins (e.g., emoglobin of some animals be- 
longs to a class of homologous proteins, which, for historical reasons, 
has been extensively studied). Consequently, these proteins are not 
statistically independent, and the conservation patterns arising from 
their study may result biased. Furthermore, the conservation pat- 
tern of homologs is likely to reflect the same functional requirements, 
like the ability to bind one given kind of molecules (e.g., to oxygen 
in the case of hemoglobin). While the conservation pattern of anal- 
ogous families of proteins are likely to be also affected by functional 
requirements, the variety of binding requirements are expected to be 



less restrictive, thus leading to weaker correlations in the sequence 
ensembles. 

For each class of analogous proteins it is possible to define the 
entropy per site, in the form 
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Si-^ = -^Pi{a) Inpi(a), (2) 



a=l 



where Pi{a) is the probability of finding an amino acid of kind a 
in site i. The probability Pi{a) is normalized in such a way that, 
for each site, ^aPii^) ~ ^- This quantity can be simply calculated 
counting the different types of amino acids which occupy each site in 
the set of analogous sequences. The site entropy indicates the degree 
of conservation of a site: the lower the entropy, the more conserved 
the site is. The largest value which the entropy S{i) can assume is 
— ln(l/20) ~ 3, corresponding to the case of uniform distribution of 
twenty kinds of amino acids. 

The analysis of the site entropy for a number of proteins [TB] 
shows that each class displays a small number of low-entropy (i.e., 
highly conserved) sites. An example is given in Fig. El The obser- 
vation that the different sequences which build a class have different 
biological tasks and different active sites (i.e., the sites on the sur- 
face which are responsible for the enzymatic activity of the protein) 
rules out the possibility that conservation is solely associated with 
the biological function of the protein. In ref. ^16j the authors con- 
clude that the conserved sites are those which control the folding 
process of the proteins. 

4 Simplified models for protein folding 

A number of models to describe the folding process at various level 
of approximation have been proposed (cf. e.g. [S] and refs. therein). 
In what follows we briefly review some of them. 

4.1 Chemical— reaction model 

The simplest model used to describe the folding of proteins consists 
in summarizing all possible conformations in few macroscopic states, 
in the same way as it is done to describe chemical reactions (see e.g. 



ref. j3]). Usually the thermodynamically relevant states are the un- 
folded state ("U"), which contains all the conformations where the 
protein chain is unstructured, the native state ("N"), which corre- 
sponds by definition to a single conformation^, and possibly some 
intermediate states ("Ii", "12",...)- The folding reactions is thus de- 
scribed by a chain of events like U ^ Ii ^ I2 ^ ■■■ ^ N leading the 
protein from the unfolded to the native conformation (and, in some 
cases, being trapped aside in dead-ends). The different states are 
considered separated by free energy barriers. The definition of the 
model is then completed by assigning to each state and to the top 
of each barrier (usually called "transition state") a numeric value of 
the free energy. 

The kinetics of a protein is then determined by Kramer's rela- 
tion P7j, which is a stochastic equation controlling the crossing of 
energy barriers in terms of the one dimensional energetic profile of 
the folding pathway [18j. For the simplest case of a two-state fold- 
ing pathways {U -^ N), the distribution of folding times which solve 
Kramer's Equation is a single exponential, whose characteristic time 
r can be approximated by 

..,.exp(^). (3) 

The prefactor k depends on the curvature of the energy profile and 
on the viscosity of the solution. The quantoty AFu^b is the dif- 
ference between the top of the free energy barrier ("B") and the 
free energy of the unfolded state, while T is the temperature in en- 
ergy units. If intermediates are present in the folding pathway, the 
distribution of folding time results in a sum of exponentials, each 
of them characterized by a characteristic time satisfying Eq. Q, 
where AFu-b is substituted by the free energy difference associated 
with the relative barrier. 

Naturally, this model accounts for the cooperativity of the folding 
transition. It also describes the single- and multi-exponential distri- 
bution of folding times (see Section 2). It can be used in connection 
with empirical free energy functions to predict the effect of muta- 
tions on the stability and kinetics of a given protein ^\. On the 
other hand, it provides little insight into the molecular mechanism 
of the folding process. Moreover, the model describes the folding 

■^With the caveats discussed in Footnote 1. 
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mechanism as a one-dimensional process, a simplification which is 
likely not to be very realistic. 

4.2 Entropy— based models 

The so called "new view" (cf. e.g. [201) aims at understanding 
protein folding through a carefuU description of the energy landscape 
associated with the conformational space. In this class of models the 
protein chain is described at some degree of realism and one tries to 
characterize energetically, usually through computer simulations, a 
large number of conformational states along and around the folding 
pathway. 

A simple model which follows this point of view is the Go model 
|21j . In it, each amino acid of the protein can be described in terms 
of a full-atom representation at one extreme of detail, or as a struc- 
tureless spherical bead at the other extreme, depending on the com- 
putational cost one is prepared to afford. The potential function 
is a sum of two-body contact terms. Each term assumes the value 
—1 if the contact in question is present in the native conformation 
(which is assumed known) and zero otherwise. It reads 

where {rj} denotes the cartesian coordinates of the atoms or of the 
amino acids, depending on the kind of description chosen, {rf} 
being the coordinates of the native conformation while A(rj — Tj) 
is a contact function which assumes the value 1 if the ith and jth 
atom (or amino acid) are closer than a contact threshold distance, 
— cxD if they are closer than a hard-core threshold distance and zero 
otherwise. In other words, the potential function records how many 
native contacts there are in a given conformation, giving an energy 
— 1 to each of them. 

According to this model, the native state is, by definition, the 
ground state of the system and is also unique. Conformational sam- 
plings performed, for example, through a Monte Carlo algorithms 
j22] shows a cooperative folding transition of the kind observed ex- 
perimentally (cf. Section 2). With some computational effort it has 
been possible to characterize the transition state, that is the state 
associated with the top of the main free energy barrier which sepa- 
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rates the native from the unfolded state^. For many small proteins, 
it has been found j2Sl I21j that this state, for each protein, is char- 
acterized by a small number of well-defined contacts. Thus, the 
protein does not fold through a random collapse of the chain, but 
follows a well-defined sequence of steps. 

The basic idea behind the Go model is that the geometry of 
the protein is the major determinant of its folding mechanism and, 
consequently, one can approximate the free energy associated with 
the conformational states through its entropic part. In this way one 
neglects the complexity of the interaction among the twenty kinds of 
amino acid and the possibility of building non-native interactions. 
Accordingly, the sequence of amino acid plays no role and there are 
no metastable states which can trap the chain. Notwithstanding all 
these caveats, the Go model has been important in emphasising the 
presence of a well-defined sequence of molecular events along the 
folding pathway. 

4.3 Energy— based models 

A different approach to the protein folding problem focus its atten- 
tion on the energetic content of the free energy function. In particu- 
lar, on the heterogeneity of the interaction arising from the presence 
of twenty kinds of different amino acids. It is known that physical 
systems displaying such an heterogeneity are associated, as a rule, 
with a rough energy landscape with many competing low-energy 
states ^^. This is a picture incompatible with that of proteins, 
which must display a unique ground state, well separated from the 
others, and as few metastable states as possible. The purpose of 
these models is to understand what makes a protein, characterized 
by a well defined amino acid sequence, different from a generic het- 
erogeneous system, whose paradigm is found in a random sequence 
of amino acids. 

An important ingredient of this kind of models is the potential 
function. The simplest choice is that of a contact potential of the 
form 

U{{r,},{a{^}) = 5^i?.«.(,)A(r, -r,), (5) 

ij 

•^In other words, the transition state describes the set of eonformations for which the 
probability to fold or to unfold coincides and is equal to 1/2. 
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where r^ and (Ti are the position and kind of the ith amino acid, 
A(rj — Tj) is the contact function defined in connection with Eq. 
(JU and BfjT^ is the element if the 20x20 interaction matrix which 
defines the iteraction energy between amino acids of kind a and r. 
A widely used interaction matrix has been determined by Miyazawa 
and Jernigan [23 from the statistical analysis of the contacts in 
a large database of known proteins, assuming that the frequency 
with which a given contact appears in the database measures the 
strength of the contact energy between the corresponding amino 
acids (cf. Table 1). This is done by calculating the probability 
Par of appearence of the contact between the amino acids of kind 
a and r, and assuming a Boltzmann-like relationship of the kind 

B„r ^ogp„r- 

The starting point of energy-based models is the study of the 
thermodynamics of a random sequence, that is, of a chain of lineraly 
connected, random chosen monomers. The simplest model used 
to describe the thermodynamical properties of such sequences is 
the random energy model (REM) [211 • In this model, the energy of 
each bond is assumed to be random, and not correlated with the 
energy of the other bonds. For a configuration with M contacts, the 
conformational energy is 

E = Sf e., (6) 

where e, is the energy associated with the zth contact of the protein 
and is, by assuption, a value picked at random from the interaction 
matrix B. Since we are dealing with random, uncorrelated values, 
the probability that a given configuration has energy E is given, if 
A^ (number of beads) is large, by the central limit theorem'* 

nE) = exp{^,) (7) 

where a is the standard deviation of the interaction and where the 
distribution of contact energies is supposed to have zero mean. The 
associated number of states is n{E) = '~f^P{E) where 7^ is the total 

*The central limit theorem states that if one has a large number of experiments which 
measure some stochastic variables (i.e., a quantity whose value is a number determined by 
the outcome of an experiment) then the probability distribution of the average of all the mea- 
surements approaches a Gaussian distribution. In other words, it states that the probability 
distribution of the sum of independent stochastic variables approaches a Gaussian distribution 
as the number of variables increases. 



13 



number of conformations available to the chain with 7 ranging from 
1.8 |2Z] to 2.2 |2H] in the case of the simple cubic-lattice models 
discussed in Sect. 5, while it takes the value 4.8 for more realistic 
models^ |77]. The number M of contacts of the chain is, in average, 
equal to N'y/2 (the factor 1/2 arising in order not to count twice the 
contacts), which, for the cubic-lattice model gives N ^ M. Thus, 
one can write n{E) as 

n{E) = expi-N{^^ - In^)). (8) 

It is seen that, if E< Ec = — Na {2 In •yY^'^, the exponent in Eq. (jHl) 
is negative. As a consequence, the number of states available with 
E < Ec decreases exponentially with the length of the chain. For 
E > Ec there are, in the limit of large N, many states, i.e, 

^(^) = ^l-7- J^. (9) 

The quantity Ec can thus be viewed as a threshold energy separating 
two regimes. Eq. (jH)) expresses the fact that Ec is the minimum con- 
formational energy associated to a random heteropolymer. Within 
the REM approximation, Ec depends only on general features of the 
system (i.e. N and a and 7), and not on the details of the sequence. 
So far we have given a description of the REM based on the 
energy, that is a microcanonical description. It is often useful to 
give up to the precise control over energy and to describe the system 
in terms of the temperature 

"' ' (10) 



dE 



T 

E=<E> ^ 



which sets the average energy < E > but allows fluctuations about 
it, of the order < E"^ > — < E >^= —dE/d{T^^) (i.e., a canonical 
ensemble description). The existence of a lowest -energy-state {Ec) 
can be used to define a critical temperature 

T = - fin 



^The number of conformations per monomer 7 is equal to the effective number of nearest 
neighbours of a monomer. This number is lower than the actual number of nearest neighbours 
(which would be 6 for the cubic lattice, and 3x3 = 9 for realistic models, where 3 is the 
number of energy minima associated with each of the two torsional degrees of freedom of 
each amino acid. See 1271 ) if one considers only the compact, thermodynamically relevant 
conformations. 
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If one decreases the temperature below Tc, the system remains frozen 
in the last available state, i.e. Ec- In other words, the quantity Tc 
sets the temperature scale of the system. 

Summing up, a random sequence displays a continuum of states, 
the lowest of which is Ec- While random sequences present a unique 
ground state this state is not well separated in energy from a large 
number of low energy states. Shakhnovich has shown (using a replica 
approach similar to that used for spin glasses [H]) that these states 
belong to different energy valleys of a rough energy landscape, val- 
leys correpsonding to very different conformations and separated by 
conspicuous energy barriers (cf. Fig. ^a)) 29 . The resulting pic- 
ture is that of an energy landscape characterized by a large number 
of competing low-energy states, and consequently displaying a ther- 
modynamics very different from that of a protein (cf. Fig. IH^b)). 

Also the kinetical features of a random sequence are quite differ- 
ent from those of a protein. The roughness of the energy landscape 
produces a myriad of metastable states which can trap the kinetics 
of the protein chain. In particular, it has been shown by Bryngel- 
son and Wolynes studying the kinetics of the random sequences [201 
that there is a temperature Tg below which the kinetics is frozen, 
that is the protein hardly can escape from metastable states. The 
temperature Tg for the random energy models happens to be equal 
to Tc. As a consequence, in the range of temperatures where the 
low energy states are populated, the kinetics is frozen. 

The question is then, how is it possible to find sequences display- 
ing protein-like features? Necessary conditions for these sequences 
are 1) that they display a unique, zero-entropy ground state and 
2) that the critical temperature Tf below which the ground state 
is populated is somewhat higher than the temperature Tg at which 
the kinetics is frozen, in such a way that the range of temperatures 
Tg < T < Tf is suitable for folding. 

It was shown by Shakhnovich [SI] that a sufficient condition to 
find good folders is to search for sequences whose native energy is 
well below Ec. Since the probability to find a random sequence with 
native energy below this threshold is exceedingly low, sequences 
such that En ^ Ec are likely to display a unique native state with 
large probability. Moreover, such sequences will display a folding 
temperature Tf higher than the kinetic freezing temperature Tg. 

Let us first define S = Ec — E^ as the energy gap between the 
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native state and the lowest state belonging to the part of the spec- 
trum described by the random energy, which is associated with all 
the conformations structurally different from the native one. In ref. 
j22] it was shown that the condition Tf > Tg is equivalent to the 
presence of a positive gap 6. According to the very definition of tem- 
perature {dS/dE = 1/T), the critical temperature Tg is the inverse 
slope of the tangent to the entropy evaluated in Ec (see Fig. EI). On 
the other hand, Tf is defined by the condition F/v = Fu, where F/v 
and Fu are the free energies of the native and of the unfolded state 
(i.e., the state belonging to the part of the spectrum described by 
the random energy model and displaying a local minimum in free 
energy), respectively. Since the native state has, by definition, zero 
entropy, this condition becomes E^ = Eu — TfSu. Tf is then the in- 
verse slope of the straight line tangent to the parabolic line defining 
the entropy in the random energy model and having zero entropy 
in Em- From Fig. El one can conclude that the condition Tf > Tg is 
equivalent to 6 > 0. 

Operatively, finding sequences with a large gap 6 = Ec — E^ 
(compared to the energy scale Tc of the system) can be done as fol- 
lows [31 : select a conformation to be the native, set the amount of 
the different kinds of amino acids, and minimizes the energy of the 
sequence by swapping the amino acids. In doing this, the energies 
of the unfolded conformations (and thus E^) do not change, because 
they depend only on the amino acids composition, while the energy 
of the native conformation, which depends on the particular se- 
quence, reaches values below Ec- This method can be implemented, 
for example, with a Metropolis Monte Carlo algorithm [22] in the 
space of sequences, performed at a low enough "selective" temper- 
ature Tg (e.g., the temperature defined in the space of sequences, 
which has the physical meaning of evolutionary bias towards low 
energy sequences). Due to the size of the space of sequences, it is 
unlikely to be able to find the absolute energy minimum, but any 
sequence with E < E^ will do the job. 

To be noted that the design of good folders does not solve the 
protein folding problem, but the so-called inverse-folding problem, 
namely: given a target conformation, find the sequences which dis- 
play this conformation as native state (i.e., non-generate, stable 
and, as we shall discuss in Sect. 5, kinetically accessible). Nonethe- 
less, the systematic study of the folding of these designed sequences 
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has opened the way to a bona fide solution of the protein folding 
problem, albeit still within the framework of minimal models of pro- 
teins (cf. Sect. 5). In any case, its extension to real proteins looks 
technically possible, provided that a reliable function to describe the 
potentials among amino acids is available. 

4.4 Molecular dynamics models 

An research tool whose importance has grown following the in- 
crease of computer power available is molecular dynamics simula- 
tion, where the protein is described in atomic detail and folding 
trajectories are generated making use of molecular dynamics algo- 
rithms. A possibility to carry out such a program consists in using 
an explicit description of the solvent molecules and integrate New- 
ton's equations of motion. An alternative method, which allows to 
save some computational time, is to use Langevin's equations [HB] , 
taking into account the solvent in an implicit way. 

There is a wide choice of realistic potential functions which can 
be used in connection with molecular dynamics simulations. Most 
of them, like GROMACS |MI, AMBER [35] and CHARMM ^, 
are obtained from chemical calculations of simpler molecules. The 
accuracy of these functions in describing the actual potentials among 
amino acids is controversial. 

The fundamental drawback of this kind of calculations consists on 
the fact that, due to their complexity, it is only possible to simulate, 
within a reasonable amount of cpu time, few trajectories lasting 
for a tiny fraction of the overall folding time. For example, the 
most ambitious simulations performed to date consists in a single 
trajectory of 1 fis of one of the smallest known proteins, namely 
viUin ISII. 

Although it is not possible to carry out a full folding trajectory, 
nor to collect meaningful statistics over different molecular events, 
one can still simulate the unfolding of protein chains, starting from 
the native conformation. The reason for this is that unfolding sim- 
ulations can be performed at high temperature, thus decreasing the 
reaction times in an important way. Calculations of this type jSH] 
(using lattice models) have shown that the amino acids which are 
important for the unfolding mechanism are the same which control 
the folding. 
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In fact, in the study of the unfolding of, for example, Chymotrypsin 
Inhibitor 2 ^^, which highlights that at the transition state (i.e. the 
state at the top of the free energy barier between the native and the 
unfolded states) only 25% of the native contacts are formed, and two 
amino acids control the kinetics, namely alanine-16 and leucine-49. 
This is in agreement with the results obtained by studying the effects 
point mutations have on the kinetics of the protein |4n] . 

5 Lattice models 

A powerful tool to study the physics of the folding mechanism of pro- 
teins is the lattice model. It is based on two approximations. First, 
the internal atomic structure of the amino acids is neglected and 
each of them is described as pointlike. Within this approximation, 
the entropy associated to the internal degrees of freedom of each 
amino acid is neglected and the force field created by each amino 
acid is regarded as isotropic. The second approximation consists in 
locating the beads representing the amino acids on the vertices of a 
cubic lattice of unitary side length. Accordingly, the conformational 
degrees of freedom are discrete. This is very convenient from a com- 
putational point of view and makes conformational entropy easy to 
handle. Making use of this approximation, the small scale motion of 
the protein (i.e., the peptide bond vibrations) is neglected and the 
chain is constrained to have unrealistic angles between monomers 
{tt/2, tt and 37r/2). A more realistic choice is to use a fee lattice 
(the average mean square of the difference between real proteins 
and their projection onto a fee lattice is ~ lA |1I]), although calcu- 
lations are slightly more complicated. Since the choice of the lattice 
does not change the underlying physics, in the following we will re- 
strict to the use of a cubic lattice. The potential function used in 
these calculations is that introduced in Eq. (0). 

The two ingredients which, notwithstanding the strong approxi- 
mations, the model retains are the polymeric character of the pro- 
tein (i.e., the fact that amino acids are linked into a chain) and the 
heterogeneity in the interaction, reflected by the interaction matrix 
BrjTT- These two ingredients are source of frustration, that is the 
impossibility for the system to satisfy all interactions at the same 
time |12] . Frustrated systems, as a rule, give rise to a rough energy 
landscape, of the kind described in Sect. 4.3 for random sequences. 
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Sequences selected making use of the algorithm described also in 
Sect. 4.3 to fold to a unique native conformation are those which 
somewhat minimize the frustration of the system [12] • 

The study of the dynamics of these designed lattice-model pro- 
teins has uncovered a remarkably simple, strongly hierarchical pic- 
ture of the folding process. Simulating the dynamics of the protein 
chain by means of , for example, a dynamical Monte Carlo algo- 
rithm (cf. Appendix), it is found that the whole process can be 
summarized as follows (cf . Fig. ^ [HI |13] : 

1. Formation of local elementary structures (LES), built by amino 
acids close along the chain, very early in the folding process, i.e 
in times of the order of 10~^ the folding time (cf. Fig. Efb)). 
LES are, as a rule, stabilized by strong local contacts (dotted 
lines) , 

2. Docking of the LES in their native conformation (dashed lines. 
Fig. Etc)) leading to the (post-critical, in the sense that it 
corresponds to a state beyond the top of the free energy bar- 
rier) folding nucleus, that is, the minimal set of native contacts 
(summed of the contacts labeled by dashed and by dotted lines. 
Fig. Efc)) which, once formed, guarantees fast folding. This 
event is interpreted as the overcoming of the main free energy 
barrier encountered by the protein in the folding process. In 
fact, after the formation of the folding nucleus the protein pro- 
ceeds downhill, almost barrier-free toward the native state (cf. 

m 

3. Relaxation of the remaining amino acids to form the corre- 
sponding native bonds which complete the folding process in 
times which are of the order of 10~^ the folding time (Fig. 

ETd)). 

Such a hierarchy of events provides an effective mechanism ac- 
cording to which the entropy is squeezed out from the system (see 
Fig. [7j). In fact, once formed, the LES can be thought of as al- 
most rigid structures. Consequently, the chain presents an effective 
length that is shorter than the actual one. Furthermore, the LES 
interact among themselves with energies which are much larger than 
those of single amino acids. Consequently, it is unlikely that LES 
form stable structures different from those for which they have been 



19 



designed. In fact, such an event would imply the simultaneous op- 
timization of a number of uncorrelated contacts, a highly unlikely 
scenario. As seen from Fig. [71 the entropy is successively reduced 
by the formation of the folding nucleus. 

5.1 The impact of mutations 

An important feature which makes protein-like sequences different 
from random heteropolymers is their behaviour with respect to mu- 
tations. In fact, random heteropolymers are very sensitive to muta- 
tions, in the sense that a mutation usually leads to a very different 
target conformation in the compaction process. For protein-like se- 
quences this is, as a rule, not the case. In fact a protein can undergo 
many mutations without changing its native state conformation. On 
the other hand, mutations made in special sites usually lead to a 
complete misfolding or to a great destabilization of the native state. 
These thoretical results agree with those found in studies of protein 
engineering in real proteins jS]. To make this point clearer we re- 
port the result of a study of the impact of point mutations on the 
designed sequence composed of 36 monomers whose folding mecha- 
nism is displayed in Fig. [HI [IE] ■ From this study it was found that 
mutated sequences can be classified into three groups: 

1. sequences that still fold to the native state, 

2. sequences which fold to a unique compact structure, usually 
similar but not identical to the native one, 

3. sequences which do not fold to a unique conformation. 

An analysis of the resulting sequences reveals that the impact of 
a mutation is dependent on the local change in energy induced on 
the native state, that is, 

AEioc[m{i) -^ m'{i)] = ^j{U^'^i)rnij) - t^m(i)m(i))A(ki - rj\), (12) 

where the amino acid m at position i is substituted by the amino acid 
m . If AEioc is small (compared to m) the mutation has no effect on 
the thermodynamics of the protein, but if large, it denaturates the 
protein. According to the value of AEioc averaged over all nineteen 
possible mutations, the different sites of a native conformation can 
be classified as (cf. Fig. Efd)): 
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1. hot: sites very sensitive to point mutations (black beads in 
FigEl^d)), characterized by a high value of AEioc as compared 
to (cr=0.3). If a hot site undergoes a mutation, the resulting 
sequence becomes as a rule denaturated, 

2. warm: the resulting sequence either folds to the same structure 
or to a structure which is quite similar to it, occupying the 
native state with a reduced probability as compared to the 
original sequence (grey beads in Fig. Efd)). 

3. cold: sites which are not sensitive to point mutation (white 
beads in Fig. El^d)). 

LES are stabilized by hot and warm sites. This is the reason 
why mutation of type (1) or (2) can affect in an important way the 
folding ability of the sequence. 

5.2 A solution to the protein folding problem 

With the help of the results discussed above, a strategy known as the 
three step strategy (3SS) was developed, which allows one to predict 
the three-dimensional native conformation of a model protein from 
its amino acid sequence jUj, provided the contact energies acting 
among the amino acids, and which was used to design the protein, 
are known. The algorithm consists of three steps, namely 1) finding 
good candidates for the role of local elementary structures, 2) finding 
good candidates for the folding nucleus and 3) finding the native 
conformation relaxing the residues not participating in the folding 
nucleus. This algorithm is based on the hierarchical sequence of 
events that allows the chain to fold fast and works because at each 
step only a limited portion of the configurational space has to be 
searched through (cf. Fig. E)). 

In what follows we briefly discuss the 1D^3D algorithm and 
apply it to a representative example of notional proteins. 

• Step 1: finding of LES which govern the folding process. Ele- 
mentary structures are called "closed" or "open", depending 
whether they contain interactions inside themselve (outside 
from the peptidic bond), or not. In fact, in some cases short 
fragments of the chain, stabilized only by the peptide bond, 
play the role of LES. Examples of closed LES are provided by 
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LES 3-6, LES 11-14 and LES 27-30 (cf. Fig. El^b)), structures 
stabilized by the native contacts drawn in terms of dotted lines. 
In keeping with this classification of LES, the present step is 
composed of two substeps, 

Substep la: Finding the open elementary structures. For each 
substring of the sequence, starting at monomer i and ending at 
monomer j {0 < i < j < N), we define the density of energy 

e. = - — T Yl ,^/.^ f^^CMfc)' (13) 

where U is the matrix of contact energies used to design the 
notional protein. In other words, e^ is the average energy with 
which each element of the substring {i,j) interact with the rest 
of the chain. The substrings which are good candidates to 
be open elementary structures in the folding process have low 
values of e^. Among such substrings we select those with values 
of e^ lower than a threshold e*. 

Substep lb: Finding the closed elementary substructures. For 
this purpose we evaluate, for each pair of monomers i and j, 
the function 

p(M-) = "'''"';''""t"'"''"' . (") 

where Te// is an effective temperature which we set equal to the 
standard deviation of the interaction matrix U (e.g. a = 0.3 
for the case of the contact matrix displayed in Table 1). This 
function has been chosen in order to maximize the attraction 
between amino acids and minimize their distance. The expo- 
nential factor p = 1.7 reflects the ratio between the number of 
conformations associated with the formation of a contact and 
the total number of conformations. If a substructure contains 
more than one interaction, the values of p associated with the 
different interactions are to be multiplied together. As possible 
(closed) local elementary structures, we select those composed 
of mononomers i,i + 1, ..., j — 1, j and with p{i,j) > p*, where 
p* is a threshold value. 

Step 2: Finding the folding nucleus. All the elementary struc- 
tures (let S be the total number of such structures) found in 
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steps la and lb are moved in space and the conformational 
spectrum is found. This is done selecting all possible choices 
of 1, 2, ..., 5" local elementary structures, giving them all possi- 
ble relative conformations and making a complete enumeration 
of their reciprocal positions in space. The conformations with 
lower energies are selected as possible candidates for the (post- 
critical) folding nucleus of the protein. 

• Step 3: Relaxing the remaining monomers around the folding 
core. This can be done through a complete enumeration of all 
the conformations displaying a given nucleus as their number is 
rather low (~ 10^ for a 36mer). Another way, which we found 
computationally attractive is to use a low-temperature Monte 
Carlo relaxation simulations, keeping fixed the monomers be- 
longing to the folding nucleus. ^ The (single) totally relaxed 
conformation with energy lower than E^ is the native confor- 
mation of the protein. 

The algorithm described above was tested with success on rep- 
resentative examples of lattice designed proteins. Below we discuss 
results concerning the designed sequence S36 (cf Fig. E^a)), which 
folds into the native structure shown in Figs. Efd) andlH^d), dis- 
playing a native energy En = —16.5, much lower than the threshold 
energy E^ = -14 (thus 6/ct = {E^ - En)/^ ^ 8). In Fig. l^a) 
we display the distribution of values of p{i,j)- Three bonds have 
a p-value which is remarkably larger than that associated with the 
rest of the possible bonds of the protein, and consequently are good 
candidates for stabilizing closed local elementary structures. The 
distribution of values of e^, displayed in Fig. E^b), shows a single 
peak, whose lowest points are associated with the same sites already 
involved in the closed elementary structures. It is thus likely that 
open elementary structures do not play any noticeable role in the 
folding process of S36. We thus search for a folding nucleus composed 
of the LES (3,4,5,6), (11,12,13,14) and (27,28,29,30), stabilized 
by the contacts 3 — 6, 11 — 14 and 27 — 30. A complete enumer- 
ation of all the conformations built out of these three elementary 

^In some cases the system is non ergodic, in the sense that from a given starting con- 
figuration it is not possible to reach all other configurations (with the folding core formed 
and fixed). In such cases several relaxation simulations are performed starting from different 
conformations (with the folding core formed and fixed). In keeping with this fact, the folding 
nucleus of a notional protein could be required not to be exceedingly stable, so as to avoid 
long-lived metastable states en route to folding. 
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substructures gives the energy distribution displayed in Fig. Efc). 
The most stable of these conformation has energy —7.81 and is, 
in fact, the actual folding core (cf. Fig. El^c)). The relaxation of 
the other amino acids around it gives the right native conformation, 
with energy En = —16.50. The next low-energy conformations built 
of the three elementary substructures have energy —7.75, —7.68 and 
—7.68. The relaxation of the other residues around these tempta- 
tive folding nuclei lead to "native" energies energies —12.40, —12.58 
and —14.05, respectively. The first two of them are larger than 
Ec = —14, so they correspond to states which belong to the set of 
structurally dissimilar conformations {q << 1) to the native confor- 
mation we are searching. The last of them has an energy just below 
Ec- Although it can hardly be confused with the native conforma- 
tion, it corresponds to a metastable state which can slow down the 
folding process. 

6 Design of non conventional drugs 

Drugs perform their activity by either activating or inhibiting some 
target component of the cell. In particular, many inhibitory drugs 
bind to an enzyme and deplete its function by preventing the bind- 
ing of the substrate. This is done by either capping the active site 
of the enzyme (competitive inhibition) or by binding to some other 
part of the enzyme to the end of leading to a structural change which 
makes the enzyme unfit to bind the substrate (allosteric inhibition). 
The two main features that inhibitory drugs must have are efficiency 
and specificity. In fact, it is not sufficient that the drug binds to the 
substrate and reduces efficiently its activity; it is also important that 
it does not interfere with other cellular processes, binding only to 
the protein it was designed for. These features are usually accom- 
plished by designing drugs which mimic the molecular properties 
of the natural substrate. In fact, the pair enzyme/substrate have 
undergone milions of years of evolution in order to display the re- 
quired features, and consequently the more similar the drug is to 
the substrate, the lower the probability that it interferes with other 
cellular processes. 

Something that this kind of inhibitory drugs usually cannot do 
is to avoid the development of resistance, a phenomenon which is 
typically related to viral protein targets, in particular those of retro- 
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viruses which, carrying their genetic information on a single RNA 
strand, display a replication mechanism prone to errors (mutations). 
Under the selective pressure of the drug, the target is often able to 
mutate the amino acids at the active site in such a way that the 
activity of the enzyme is essentially retained, while the drug is no 
longer able to bind to it. An important example of drug resistance is 
connected with that displayed by patients affected by the HIV virus 
(AIDS). In this case one of the target proteins, the HIV-protease, 
is found to mutate its active site so as to elude the effects of con- 
ventional drugs within a period of 6-8 months after the starting of 
the therapy. 

Making use of the insight obtained from the study of model pro- 
tein folding and of the 3SS discussed above, the possibility emerges 
of designing drugs which interfere with the folding mechanism of the 
target protein, destabilizing it and making it prone to proteolisis'''. 
Furthermore exploiting the hierarchical folding mechanism of pro- 
teins, it is possible to design drugs which not only are efficient, and 
specific but which, at the same time do not suffer from the upraise 
of resistance. 

As shown above, local elementary structures are the building 
blocks that make up the folding nucleus. To perform their job in an 
effective way LES interact strongly only with their complementary 
structures, avoiding the formation of metastable states. This fact 
suggests that peptides with the same sequence as LES could be used 
at profit as drugs. These peptides would interact strongly only with 
the LES of the protein, and consequently block their assembling. To 
assess the correctness of these statements, the effect of these pep- 
tides (which will be shortened as p-LES), on the folding ability of 
a variety of lattice designed proteins of different length have been 
studied. 

The central quantity in this study is the parameter q which mea- 
sures the similarity between a configuration at time t and the na- 
tive state. To each configuration T^ is associated a contact map 
Ajj = A(| Tj — Tjl). Given a map Ajj(ra) and the map relative to 

''^That is, the cleavage into the original aminoacids of misfolded proteins. Protcolisis is 
operated by a number of enzymes which are quite ubiquitous in cells and whose function is 
to "clean up" the cell from non— functional proteins. 
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the native state Aij(rjv), the similarity parameter q is defined as, 

s.j<jAjj(r7v) 

A typical simulation involves a designed protein sequence and a 
number of shorter chains, whose length lies within the range of 2-12 
residues. The peptides used during the simulation are indicated as 
p-LES n'-m', where n' and m' are the first and the last monomer of 
the p-LES, following the numeration of the protein sequence. Their 
activity has been checked against that of non-specific short peptides 
(denoted p n'-m') built of pieces of the protein not belonging to LES, 
whose first and last amino acid are n' and m', respectively. 

At each Monte Carlo step of the simulation of the folding of the 
designed protein in the presence of Up peptides, one of the rip + 1 
chains (i.e., either the designed protein or one of the peptides) is 
picked up with equal probability. Then a site of the selected chain 
is chosen with a probability l/L, where L is the length of the chain. 
A move of the type displayed in Fig. |2S1 (cf. Appendix) is then 
attempted. We let the system evolve for about 2 x 10® Monte Carlo 
steps, recording the value of the similarity parameter q at every 1500 
steps. Making use of these values the normalized probability func- 
tion p[q) is constructed. The population of the native conformation 
is defined as the fraction of the probablity function with q > 0.7, 
that is J^.^p{q)dq. The value 0.7 has been chosen as it corresponds 
to the minimum in p{q) separating the peaks associated with the un- 
folded and with the folded phases of the isolated protein (cf. black 
continuous curve in Fig. irUj) . 

6.1 Destabilizing effects of p— LES 

We shall first discuss the p-LES strategy on the test sequence 5*36. 
The simulations have been performed at the folding temperature, 
at which the population of the native state is ^. Although this is 
quite a high temperature from the biological point of view, it al- 
lows to compare results obtained studying different lattice designed 
proteins, displaying different thermodynamical properties on equal 
footing. In the case of S36 the folding temperature is, in the units 
we are using {RTroom = 0.6^), T = 0.24. 

We now study how the presence of a number Up of peptides of dif- 
ferent types which correspond both to p-LES and to non p-LES (as a 
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check), affect the folding of the designed protein Sae- As mentioned 
above, the folding and stability of the sequence 5*36 is controlled by 
three LES, namely LES 3-6, LES 11-14 and LES 27-30. The equi- 
librium distribution of q for the protein S36 in presence of a number 
Up of p-LES of kind 3-6 is displayed in Fig. ^1 The degree by which 
the protein is hindered of reaching the native state in the presence 
of the peptide 3'-6' is shown in Fig. ^2 which displays a monotonic 
decrease of the population of the native state, reaching essentially 
zero for rip = 4. 

Similar to the previous case, the equilibrium distribution of states 
of S36 in presence of Up p-LES 27-30' and ll'-14' are displayed in 
Figs. CMni and EHISl respectively. The effect of p-LES 27'-30' 
is stronger than that of the p-LES 3'-6', totally destabilizing the 
protein (i.e., the value of /g^p(g)(ig is essentially zero) already at 
Up = 2. On the other hand p-LES 11 '-14' appears to be somewhat 
less effective than the other two p-LES in blocking the folding of S36 
(cf. Fig. [T3j) . The results shown above essentially do not change 
if instead of starting the folding simulations of S36 with a number 
Up of p-LES with S36 in a denatured (elongated) conformation, the 
simulations are started with S36 in the native conformation, as was 
expected from the fact that we are studying the equilibrium proper- 
ties of the system, and that proteins display important fluctuations 
around the native conformation. From what has been shown so far 
it is clear that the presence of one or more p-LES leads to an impor- 
tant destabilization of S36. A question which now arises is : what 
would happen if instead of peptides of type p-LES one uses peptides 
(eventually built out again of four residues) which corresponds to 
segments of the designed sequence not belonging to the folding nu- 
cleus ? To answer this question, simulations have been carried out 
in which the folding of S36 is studied in the presence of a number Up 
of peptides p 8'-ll' or p 30 '-33'. 

A single peptide of type 8'-ll' seems to slightly increase the sta- 
bility of S36 while three destabilize it by a very small extent, both 
effects being only marginal. A similar result is found for p 30'-33' 
(cf. Fig. Uni) To check whether these results are due to the fact that 
the interaction of peptides p 8'-ll' and 30'-33' with the protein is 
much weaker than that associated with p-LES, we have increased 
the interaction of S'-ll' and 30'-33' with S36, so as to mimic the 
interaction energies typical of p-LES. In particular, in the case of p 
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8-11' the interaction of the contacts 8-21, 9'-22' 10-15 and 11'- 
14' were increased to the average value of the contact energy acting 
between two LES. The results are similar to those obtained using 
the original peptide 8'-ll', indicating again the lack of destabilizing 
effects (cf. Fig. HZD. 

To test the generality of the results discussed in this section we 
have studied a second 36-nier. The interest in this second 36-nier 
is due to the fact that it was designed minimizing the number of 
local contacts 48j. Since local contacts play an important role in 
the hierarchy of folding events, the behaviour of a protein with as 
few as possible local contacts is particularly interesting. The pri- 
mary structure, the LES, the folding nucleus and the native state 
conformation associated with this 36-mer are shown in Fig. ITHlfa). 
(b), (c) and (d) respectively. For this sequence we determined how 
the presence of one p-LES affects the stability of the protein. The 
results are shown in Fig. ^1 The longest p-LES l'-6' destabilizes 
the protein (the population of the native state drops from ~ 50% 
(without p-LES) to (29.4%) in presence of one p-LES). On the other 
hand, the behaviour of the other two p-LES based on the shorter 
LES (20'-22' and 30'-31', which are "open" LES, according to the 
definition used in Sect. 5.2) essentially do not affect the stability 
of the protein. The reason for this behaviour can be understood 
in terms of specificity. The shorter the p-LES, the higher is the 
probability that it binds to some part of the protein other than the 
LES it was designed for. These p-LES, binding not specifically to 
the protein, are not likely to be effective in denaturing the protein. 
Since this particular 36-mer protein has been chosen to have few 
local contacts, it displays some LES which are "marginal", in the 
sense that lie at the borderline (for their length and for their "open" 
character) of being able to behave like LES. 

We have also studied a larger protein, composed of 48 residues. 
Its primary structure, its LES, folding nucleus and native state con- 
formation are displayed in Figs. I2ni(a)5(b),(c) and (d), respectively. 
For this designed protein we determine how the presence of one p- 
LES affects the stability of the protein. The results are displayed 
in Fig. I^and are in averall agreement with the results obtained in 
the study of the 36-mers. 
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6.2 Dynamical aspects of the inhibiting mechanism 

In this section we investigate how the presence of Up p-LES 3 '-6' 
affects the dynamics of the folding process of the lattice designed 
protein S36. We focus our attention on the dynamics of the formation 
of contacts 3-6, 3-30 of S36, representative of the contacts stabilizing 
the LES 3-6 and that existing, in the native structure, between 
the LES 3-6 and LES 27-30. We also study the probability (as a 
function of the number of MC steps), that a LES interacts with a 
p-LES. In particular, the dynamics of the interaction between site 
29 of S36 and any amino acid belonging to one of the Up p-LES 
3-6' peptides is studied as a function of time (Monte Carlo steps), 
averaging the contact formation probability over 1000 independent 
simulations . One finds that the formation of bond 3-6, i.e. the 
formation of LES 3-6 is not affected by the presence of Up p-LES 
3'-6' peptides. This testifies to the fact that p-LES do not interfere 
with the formation of the LES, a phenomenon which occurs very 
early in the folding process (~ lO^MC steps). On the other hand, the 
p-LES interferes with the docking of LES that is, with the formation 
of the folding nucleus, as it is clear from Fig. |221 Furthermore, the 
time evolution of the contact between p-LES of type 3'-6' and LES 
27-30 of the designed protein is also found to depend on the number 
of LES. The higher Up, the shorter is the time employed by a p-LES 
to bind to the corresponding LES. 

Summing up, one observes that p-LES interact with LES pre- 
venting their docking to form the folding nucleus. Furthermore, 
simulations indicate that the efficiency of p-LES in denaturing the 
protein do not depend on its initial conditions (whether the protein 
is in the native or in an unfolded state). 

6.3 Can the system develop resistance? 

One of the main problems related to conventional inhibitor-drugs 
is the phenomenon of resistance of the lattice designed protein S36. 
Proteins displaying mutations which arise, as a rule, due to the large 
inaccuracy of genetic regulation associated with retroviruses, that 
is, viruses which carry their genetic information in a single RNA 
strand, profit from mutations in the target site of the drug. This is, 
as a rule, its active site. Consequently the ability to fold and thus 
the ability to carry out its biological function is essentially retained. 
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while the drug is no longer able to to bind effectively to the protein. 
This phenomenon is particularly important, for example, in the case 
of HIV virus. One of the main proteins involved in the assembly of 
the virus during its replication, and thus main target of conventional 
drugs, the HIV-protease, usually develops drug resistance in about 
6-8 months from the starting of the therapy (cf. ref. jlH] and refs. 
therein). Thus, an important question connected with the design 
of non-conventional drugs discussed above is: can the system, when 
targeted by the small peptides p-LES, develop resistance ?. LES 
are made up of strongly interacting, as a rule hydrophobic amino 
acids occupying hot and warm sites well protected inside the pro- 
tein. Consequently, one expects that mutations upon the target site 
of a p-LES (i.e., its complementary LES) lead to a denaturation or 
to a conspicuous destabilization of the native state of the protein. 
To test this expectation the sequence S36 was subjected to a drug- 
induced evolutionary pressure. In other words, simulations of the 
folding of sequences Sse', obtained from S36 by point mutations, in 
the presence of np p-LES were carried out and the results analyzed. 
Two possible outcomes were found: 1) the mutation leads to a com- 
plete denaturation of the protein thus making it totally inactive, 
2) the mutated sequence still folds to the native, biologically active 
state, although the native state is less stable. In the first case p-LES 
have no effect on the behaviour of the protein. In the second case 
they retain their effectiveness interefering with the folding process 
and with the stability of the protein, very much as they did in the 
simulations discussed in Sect. 6.1 and 6.3 (cf. Figs. 1221 and EIJ) • 

6.4 Perspectives on the use of p-LES as drugs 

We have shown how it is possible to inhibit the activity of a protein 
by blocking its folding with the help of small peptides which mimic 
the LES. The very reason why LES make the protein fold fast confers 
p-LES the features required to a drug to qualify as such: efficiency 
and specificity. p-LES are efficient because they bind to the protein 
as strongly as LES bind to each other to form the folding nucleus. 
Since LES are responsible for the stability of the protein, their sta- 
bilization energy must be of the order of several times kT . These 
peptides are also as specific as LES are, a specificity which LES have 
developed in millions of years of evolution to avoid both metastable 
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Figure 1: The atomic structure of a small protein, Chemotripsin Inhibitor 2. 
The dark grey curve highlights the chain structure. 

states and aggregation, let alone make the protein fold fast. The 
possibility of developing non-conventional drugs in actual situations 
is tantamount to being able to determine the LES of a given protein. 
This, in principle can be done either experimentally, for example, 
making use of y?- value analysis (i.e., measuring the relative change 
in the free energy between the native state and the transition state 
upon mutation: high y?- values are associated with "hot" sites 0) 
or extending the algorithm discussed in Sect. 5.2 with a realistic 
force field. The resulting peptides can be used directly as drugs, or 
as templates to build mimetic molecules, which eventually do not 
display problems connected with digestion or allergies. A feature 
which makes these drugs quite promising as compared to conven- 
tional ones is to be found in the fact that the target protein would 
not be able to evolve through mutations to escape the action of 
the drug, as it happens, e.g., in the case of viral proteins, because 
mutations of residues belonging to LES would, in any case, lead to 
protein denaturation. 
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GWVILLGPTHGGACA 
AWVINLGPTHGGACA 
LWVILLGPTHGWACA 

VWACHAF SH W Y YCL 
VWSCHPFISHWVYCL 
VWARHAF SH W YRCL 
VWACCAFSHMYYCL 

RWLYTNNKHARWCC 




Figure 2: An example of aligment of sequences (left side) folding to the same na- 
tive conformations (right side) . Sequences displaying a high degree of similarity 
are defined as homologous and are grouped together in the figure. Pairs of se- 
quences which display little similarity are defined as analogous (see text). Both 
analogous and homologous sequences share, as a rule, a small set of residues 
which are highly conserved (and which are indicated with bold characters in the 
left side of the figure) . 
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Figure 3: The entropy per site in the space of sequences, defined in Sect. 3, 
the familiy of analogs of Chemotrypsin Inhibitor 2. 
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Figure 4: Sketch of the free energy landscape of (a) a random heteropolynier, (b) 
a selected protein. The quantity Fc is the free energy of the lowest conformation 
of the random heteropolynier, while Fn is that associated with the native con- 
formation of the protein. The x-axis corresponds to a generic conformational 
coordinate. Note that, although this coordinate is one-dimensional for necessity 
of drawing, the conformational space of a protein is very-high dimensional. 
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Figure 5: A sketch of the entropy as a function of energy for a selected sequence. 
The energy Ej^ of the native state hes well below Ec- The critical temperature T^ 
is the inverse slope of the tangent to the entropy in Ec- The folding temperature 
Tf is tangent to the entropy at the energy Ejj corresponding to the unfolded 
state and passes through the native energy E^^ (the corresponding entropy being 
zero by definition). 




Figure 6: The hierarchy of folding events for a model protein consisting of 
36 amino acids, whose native state enegy is —17.13 in the appropriate units 
{RTroom — 0-6;^^) I to be compared with E^ = —14. From a random coil con- 
formation (a), local elementary structures are formed very early in the folding 
process, that is, after about 10^ Monte Carlo steps (b); next the folding nucleus 
forms (c) and the remaining native contacts are formed to complete the folding 
process (d). Black, grey and white beads indicate amino acids occupying hot, 
warm and cold sites (cf. Sect. 5.1). 
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Figure 7: Steps by which sequence shown in Fig. reduces its entropy. For 
a random coil there are 10^'* possible conformations. As the system collapses 
to a random globule there are 10^® conformations available. Successively, LES 
are formed (10^^ conformations). The formation of the contacts between two 
of the three LES reduces the number of available conformations to 10^^ while 
when the nucleus is completely formed the system has to search among 10^ 
conformations. 
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Figure 8: The primary structure of S36 (a), its LES (b), the folding nucleus (c) 
and the native state conformation (d). To be noted that the amino acids com- 
posing the LES 3-6, 11-14 and 27-30 are KWLE, RIAD and KIME, respec- 
tively. Making use of the MJ contact matrix (cf. Table CI App.C) Uke=-0-97, 
Umw=-0-GO, Uli=-OAI, Uia=-0.22, UER=-0-7'i and Ukd=-0-76 one obtaines 
that the interaction energies between the LES (3-6)-(27-30), (3-6)-(ll-14) and 
(27-30)-(ll-14) are -1.92, -1.15 and -0.98 respectively. 
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Figure 9: (a) The distribution of the parameter p(i,j) (cf. Eq. (14)), whose max- 
imization allows to find the closed elementary structures, (b) the distribution 
of the energy density e^ (cf. Eq. (13)), employed to find open elementary struc- 
tures, (c) The distribution of the energies associated with the possible folding 
nuclei of sequence S'36, build of the elementary structures 3-4-5-6, 11-12-13-14 
and 27-28-29-30. 
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Figure 10: Equilibrium population of S36 folding in the presence of n^ p-LES 
3'-6'. 
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Figure 11: The relative occupancy ^"'''"''("p^ 



of the native state shown in Fig. 



IHId) by the sequence S36 as a function of the number rip p-LES 3'-6'. 
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Figure 12: Equilibrium population of 835 in presence of Up 27'-30' peptides. 




Figure 13: The relative occupancy p"^"^ ?^\ of the native state shown in Fig 
I12r d') by the sequence S36 interacting with Up p-LES 27'-30'. 
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Figure 14: Equilibrium population of 835 in presence of n^ 11 '-14' peptides. 




Figure 15: The relative occupancy ^°"'"'^"pJ of the native state shown in Fig 
I14r d') by the sequence S36 interacting with rip p-LES ll'-14'. 
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Figure 16: Equilibrium population of Sag folding in presence of Up p 30'~33'. 
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Figure 17: Stability of the sequence Ssg in presence of rip peptides which interact 
with the part of the protein with which the amino acids 8-11 of Sag interact in 
the native state. The contact energies of p 8'-ll'have been increased so that 
this interaction is as strong as the average interaction energy of LES among 
themselves. 
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Figure 18: (a) Primary structure of the designed 36-mer j 
folding nucleus (c) and the native state conformation (d). 



its LES (b), the 
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Figure 19: Stability of the protein displayed in Fig. 1181 fodling in presence of 
Up = \ p-LES. 
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Figure 20: The primary structure of the 48-mer used (a), its LES (b), the folding 
nucleus (c) and the native state conformation (d). 
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Figure 21: Stability of the protein displayed in Fig. EHl fodling in presence of 
Up — 1 p-LES. 
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Figure 22: Probability for the bond 3-30 of the protein to be formed during the 
folding process as a function of the number np(=0,l,2,3,4 and 5) of p-LES 3'-6'. 
This contact is taken as representative of the interaction between LES 3-6 and 
LES 27-30 as a whole. 
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Figure 23: The impact of the point mutation K27G on the folding of S'ag where 
amino acid K which occupies site number 27 has been substituted with amino 
acid G. The native state is completely destabilized and the activity of the protein 
is inhibited, in keeping with the fact that site 27 of 835 is a hot site (cf. Fig. 
I3d) ). The presence oi Up = 2 p-LES3'-6' leaves the population of states 
unchanged. 



A The Monte Carlo algorithm 

The whole thermodynamical information about a protein chain (e.g., 
the stabihty of the native state, the folding temperature, etc.) is 
contained in the partition function 



Z = J]exp(-E(r)/T), 



(16) 



where the sum is performed over all the possible conformations of 
the system and the energy function is, in the present model, that 
given by Eq. ElThe huge number of conformations that even a short 
chain can assume make unfeasible the exact enumeration in Eq. ^J 
by calculators, nor the function is simple enough to be summed 
analitically. 

The Monte Carlo algorithm f^ is meant to give an estimation 
of the partition function through the summation of Eq. ^J only 
over a limited set of conformations. If the choice of this set were 
made randomly, the algorithm would be rather inefficient (except 
at high temperatures), since most of the states display high energy 
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Figure 24: The impact of a point mutation I28S on the folding of 835. In 
keeping with the fact that site 28 is a cold site (Fig. |S{d) ) the protein retains 
its ability to fold although displaying a less stable native conformation (compare 
the continuous curve of this figure and of Fig lTUI) . The drug (p-LES3'-6')is still 
effective in inhibiting the folding of the mutated protein. 



and consequently the associated exponential is small. To solve this 
problem, the Monte Carlo algorithm builds a Markov chain of states 
of the system, i.e. a artificial dynamics, which has the purpose 
of providing the set over which sum the partition function. The 
algorithm consists of three steps: 

1. Chosing a random starting conformation of the chain. 

2. Performing a random move chosen among a set of permitted 
elementary moves. 

3. Accept the move with a probability chosen in such a way that 
the distribution of states tends, after a large number ov moves, 
towards a Boltzman distribution. 

Steps (2) and (3) are then repeated a large number of times (usually 
called Monte Carlo steps, or MC steps). 

In the present calculations, it is chosen the Metropolis acceptance 
probability, given by 



Pa 



1 if AE < 0, 

exp(-AE/T) if A^ > 0, 



(17) 
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Figure 25: The set of moves used in the Monte Carlo algorithm, known (from 
top to bottom) as head/tail move, corner flip and crankschaft, respectively. 
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where AE" is the change in the energy of the chain caused by the 
(possible) move. The set of moves we used is displayed in Fig. |2S1 
and is composed of (1) the move of the head or the tail of the chain 
in a neighbouring site, (2) the flip of a corner conformation and (3) a 
crankschaft. In principle, this set (as any set of local moves) does not 
make the system ergodic, a feature which is required by the Monte 
Carlo algorithm to work, but the subsets of conformational chain 
which are disjoint from the rest are so tiny, that we can consider 
effectively the system as ergodic. 

Apart from describing the thermodynamics of the system, it has 
been shown that the Monte Carlo algorithmalso provides a reliable 
description for the dynamics if the set of moves employed is local 
|5Uj . In fact, it was shown 51_^ that the trajectories resulting from a 
Monte Carlo simulation with local moves constitute the solution of 
a diffusive Fokker-Plank equation, which, in turn, is equivalent to 
Langevin dynamics. 
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